{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\nGet Started with VTA\n====================\n**Author**: `Thierry Moreau <https://homes.cs.washington.edu/~moreau/>`_\n\nThis is an introduction tutorial on how to use TVM to program the VTA design.\n\nIn this tutorial, we will demonstrate the basic TVM workflow to implement\na vector addition on the VTA design's vector ALU.\nThis process includes specific scheduling transformations necessary to lower\ncomputation down to low-level accelerator operations.\n\nTo begin, we need to import TVM which is our deep learning optimizing compiler.\nWe also need to import the VTA python package which contains VTA specific\nextensions for TVM to target the VTA design.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from __future__ import absolute_import, print_function\n\nimport os\nimport tvm\nfrom tvm import te\nimport vta\nimport numpy as np"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Loading in VTA Parameters\n~~~~~~~~~~~~~~~~~~~~~~~~~\nVTA is a modular and customizable design. Consequently, the user\nis free to modify high-level hardware parameters that affect\nthe hardware design layout.\nThese parameters are specified in the :code:`vta_config.json` file by their\n:code:`log2` values.\nThese VTA parameters can be loaded with the :code:`vta.get_env`\nfunction.\n\nFinally, the TVM target is also specified in the :code:`vta_config.json` file.\nWhen set to *sim*, execution will take place inside of a behavioral\nVTA simulator.\nIf you want to run this tutorial on the Pynq FPGA development platform,\nfollow the *VTA Pynq-Based Testing Setup* guide.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "env = vta.get_env()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "FPGA Programming\n----------------\nWhen targeting the Pynq FPGA development board, we need to configure\nthe board with a VTA bitstream.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# We'll need the TVM RPC module and the VTA simulator module\nfrom tvm import rpc\nfrom tvm.contrib import utils\nfrom vta.testing import simulator\n\n# We read the Pynq RPC host IP address and port number from the OS environment\nhost = os.environ.get(\"VTA_RPC_HOST\", \"192.168.2.99\")\nport = int(os.environ.get(\"VTA_RPC_PORT\", \"9091\"))\n\n# We configure both the bitstream and the runtime system on the Pynq\n# to match the VTA configuration specified by the vta_config.json file.\nif env.TARGET == \"pynq\" or env.TARGET == \"de10nano\":\n\n    # Make sure that TVM was compiled with RPC=1\n    assert tvm.runtime.enabled(\"rpc\")\n    remote = rpc.connect(host, port)\n\n    # Reconfigure the JIT runtime\n    vta.reconfig_runtime(remote)\n\n    # Program the FPGA with a pre-compiled VTA bitstream.\n    # You can program the FPGA with your own custom bitstream\n    # by passing the path to the bitstream file instead of None.\n    vta.program_fpga(remote, bitstream=None)\n\n# In simulation mode, host the RPC server locally.\nelif env.TARGET in (\"sim\", \"tsim\", \"intelfocl\"):\n    remote = rpc.LocalSession()\n\n    if env.TARGET in [\"intelfocl\"]:\n        # program intelfocl aocx\n        vta.program_fpga(remote, bitstream=\"vta.bitstream\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Computation Declaration\n-----------------------\nAs a first step, we need to describe our computation.\nTVM adopts tensor semantics, with each intermediate result\nrepresented as multi-dimensional array. The user needs to describe\nthe computation rule that generates the output tensors.\n\nIn this example we describe a vector addition, which requires multiple\ncomputation stages, as shown in the dataflow diagram below.\nFirst we describe the input tensors :code:`A` and :code:`B` that are living\nin main memory.\nSecond, we need to declare intermediate tensors :code:`A_buf` and\n:code:`B_buf`, which will live in VTA's on-chip buffers.\nHaving this extra computational stage allows us to explicitly\nstage cached reads and writes.\nThird, we describe the vector addition computation which will\nadd :code:`A_buf` to :code:`B_buf` to produce :code:`C_buf`.\nThe last operation is a cast and copy back to DRAM, into results tensor\n:code:`C`.\n\n![](https://raw.githubusercontent.com/uwsampl/web-data/main/vta/tutorial/vadd_dataflow.png)\n\n     :align: center\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Input Placeholders\n~~~~~~~~~~~~~~~~~~\nWe describe the placeholder tensors :code:`A`, and :code:`B` in a tiled data\nformat to match the data layout requirements imposed by the VTA vector ALU.\n\nFor VTA's general purpose operations such as vector adds, the tile size is\n:code:`(env.BATCH, env.BLOCK_OUT)`.\nThe dimensions are specified in\nthe :code:`vta_config.json` configuration file and are set by default to\na (1, 16) vector.\n\nIn addition, A and B's data types also needs to match the :code:`env.acc_dtype`\nwhich is set by the :code:`vta_config.json` file to be a 32-bit integer.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Output channel factor m - total 64 x 16 = 1024 output channels\nm = 64\n# Batch factor o - total 1 x 1 = 1\no = 1\n# A placeholder tensor in tiled data format\nA = te.placeholder((o, m, env.BATCH, env.BLOCK_OUT), name=\"A\", dtype=env.acc_dtype)\n# B placeholder tensor in tiled data format\nB = te.placeholder((o, m, env.BATCH, env.BLOCK_OUT), name=\"B\", dtype=env.acc_dtype)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copy Buffers\n~~~~~~~~~~~~\nOne specificity of hardware accelerators, is that on-chip memory has to be\nexplicitly managed.\nThis means that we'll need to describe intermediate tensors :code:`A_buf`\nand :code:`B_buf` that can have a different memory scope than the original\nplaceholder tensors :code:`A` and :code:`B`.\n\nLater in the scheduling phase, we can tell the compiler that :code:`A_buf`\nand :code:`B_buf` will live in the VTA's on-chip buffers (SRAM), while\n:code:`A` and :code:`B` will live in main memory (DRAM).\nWe describe A_buf and B_buf as the result of a compute\noperation that is the identity function.\nThis can later be interpreted by the compiler as a cached read operation.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# A copy buffer\nA_buf = te.compute((o, m, env.BATCH, env.BLOCK_OUT), lambda *i: A(*i), \"A_buf\")\n# B copy buffer\nB_buf = te.compute((o, m, env.BATCH, env.BLOCK_OUT), lambda *i: B(*i), \"B_buf\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Vector Addition\n~~~~~~~~~~~~~~~\nNow we're ready to describe the vector addition result tensor :code:`C`,\nwith another compute operation.\nThe compute function takes the shape of the tensor, as well as a lambda\nfunction that describes the computation rule for each position of the tensor.\n\nNo computation happens during this phase, as we are only declaring how\nthe computation should be done.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Describe the in-VTA vector addition\nC_buf = te.compute(\n    (o, m, env.BATCH, env.BLOCK_OUT),\n    lambda *i: A_buf(*i).astype(env.acc_dtype) + B_buf(*i).astype(env.acc_dtype),\n    name=\"C_buf\",\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Casting the Results\n~~~~~~~~~~~~~~~~~~~\nAfter the computation is done, we'll need to send the results computed by VTA\nback to main memory.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>**Memory Store Restrictions**\n\n  One specificity of VTA is that it only supports DRAM stores in the narrow\n  :code:`env.inp_dtype` data type format.\n  This lets us reduce the data footprint for memory transfers (more on this\n  in the basic matrix multiply example).</p></div>\n\nWe perform one last typecast operation to the narrow\ninput activation data format.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Cast to output type, and send to main memory\nC = te.compute(\n    (o, m, env.BATCH, env.BLOCK_OUT), lambda *i: C_buf(*i).astype(env.inp_dtype), name=\"C\"\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This concludes the computation declaration part of this tutorial.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Scheduling the Computation\n--------------------------\nWhile the above lines describes the computation rule, we can obtain\n:code:`C` in many ways.\nTVM asks the user to provide an implementation of the computation called\n*schedule*.\n\nA schedule is a set of transformations to an original computation that\ntransforms the implementation of the computation without affecting\ncorrectness.\nThis simple VTA programming tutorial aims to demonstrate basic schedule\ntransformations that will map the original schedule down to VTA hardware\nprimitives.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Default Schedule\n~~~~~~~~~~~~~~~~\nAfter we construct the schedule, by default the schedule computes\n:code:`C` in the following way:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Let's take a look at the generated schedule\ns = te.create_schedule(C.op)\n\nprint(tvm.lower(s, [A, B, C], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Although this schedule makes sense, it won't compile to VTA.\nIn order to obtain correct code generation, we need to apply scheduling\nprimitives and code annotation that will transform the schedule into\none that can be directly lowered onto VTA hardware intrinsics.\nThose include:\n\n - DMA copy operations which will take globally-scoped tensors and copy\n   those into locally-scoped tensors.\n - Vector ALU operations that will perform the vector add.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Buffer Scopes\n~~~~~~~~~~~~~\nFirst, we set the scope of the copy buffers to indicate to TVM that these\nintermediate tensors will be stored in the VTA's on-chip SRAM buffers.\nBelow, we tell TVM that :code:`A_buf`, :code:`B_buf`, :code:`C_buf`\nwill live in VTA's on-chip *accumulator buffer* which serves as\nVTA's general purpose register file.\n\nSet the intermediate tensors' scope to VTA's on-chip accumulator buffer\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "s[A_buf].set_scope(env.acc_scope)\ns[B_buf].set_scope(env.acc_scope)\ns[C_buf].set_scope(env.acc_scope)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "DMA Transfers\n~~~~~~~~~~~~~\nWe need to schedule DMA transfers to move data living in DRAM to\nand from the VTA on-chip buffers.\nWe insert :code:`dma_copy` pragmas to indicate to the compiler\nthat the copy operations will be performed in bulk via DMA,\nwhich is common in hardware accelerators.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Tag the buffer copies with the DMA pragma to map a copy loop to a\n# DMA transfer operation\ns[A_buf].pragma(s[A_buf].op.axis[0], env.dma_copy)\ns[B_buf].pragma(s[B_buf].op.axis[0], env.dma_copy)\ns[C].pragma(s[C].op.axis[0], env.dma_copy)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "ALU Operations\n~~~~~~~~~~~~~~\nVTA has a vector ALU that can perform vector operations on tensors\nin the accumulator buffer.\nIn order to tell TVM that a given operation needs to be mapped to the\nVTA's vector ALU, we need to explicitly tag the vector addition loop\nwith an :code:`env.alu` pragma.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Tell TVM that the computation needs to be performed\n# on VTA's vector ALU\ns[C_buf].pragma(C_buf.op.axis[0], env.alu)\n\n# Let's take a look at the finalized schedule\nprint(vta.lower(s, [A, B, C], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This concludes the scheduling portion of this tutorial.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "TVM Compilation\n---------------\nAfter we have finished specifying the schedule, we can compile it\ninto a TVM function. By default TVM compiles into a type-erased\nfunction that can be directly called from python side.\n\nIn the following line, we use :code:`tvm.build` to create a function.\nThe build function takes the schedule, the desired signature of the\nfunction(including the inputs and outputs) as well as target language\nwe want to compile to.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "my_vadd = vta.build(s, [A, B, C], \"ext_dev\", env.target_host, name=\"my_vadd\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Saving the Module\n~~~~~~~~~~~~~~~~~\nTVM lets us save our module into a file so it can loaded back later. This\nis called ahead-of-time compilation and allows us to save some compilation\ntime.\nMore importantly, this allows us to cross-compile the executable on our\ndevelopment machine and send it over to the Pynq FPGA board over RPC for\nexecution.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Write the compiled module into an object file.\ntemp = utils.tempdir()\nmy_vadd.save(temp.relpath(\"vadd.o\"))\n\n# Send the executable over RPC\nremote.upload(temp.relpath(\"vadd.o\"))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Loading the Module\n~~~~~~~~~~~~~~~~~~\nWe can load the compiled module from the file system to run the code.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "f = remote.load_module(\"vadd.o\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Running the Function\n--------------------\nThe compiled TVM function uses a concise C API and can be invoked from\nany language.\n\nTVM provides an array API in python to aid quick testing and prototyping.\nThe array API is based on `DLPack <https://github.com/dmlc/dlpack>`_ standard.\n\n- We first create a remote context (for remote execution on the Pynq).\n- Then :code:`tvm.nd.array` formats the data accordingly.\n- :code:`f()` runs the actual computation.\n- :code:`numpy()` copies the result array back in a format that can be\n  interpreted.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Get the remote device context\nctx = remote.ext_dev(0)\n\n# Initialize the A and B arrays randomly in the int range of (-128, 128]\nA_orig = np.random.randint(-128, 128, size=(o * env.BATCH, m * env.BLOCK_OUT)).astype(A.dtype)\nB_orig = np.random.randint(-128, 128, size=(o * env.BATCH, m * env.BLOCK_OUT)).astype(B.dtype)\n\n# Apply packing to the A and B arrays from a 2D to a 4D packed layout\nA_packed = A_orig.reshape(o, env.BATCH, m, env.BLOCK_OUT).transpose((0, 2, 1, 3))\nB_packed = B_orig.reshape(o, env.BATCH, m, env.BLOCK_OUT).transpose((0, 2, 1, 3))\n\n# Format the input/output arrays with tvm.nd.array to the DLPack standard\nA_nd = tvm.nd.array(A_packed, ctx)\nB_nd = tvm.nd.array(B_packed, ctx)\nC_nd = tvm.nd.array(np.zeros((o, m, env.BATCH, env.BLOCK_OUT)).astype(C.dtype), ctx)\n\n# Invoke the module to perform the computation\nf(A_nd, B_nd, C_nd)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Verifying Correctness\n---------------------\nCompute the reference result with numpy and assert that the output of the\nmatrix multiplication indeed is correct\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Compute reference result with numpy\nC_ref = (A_orig.astype(env.acc_dtype) + B_orig.astype(env.acc_dtype)).astype(C.dtype)\nC_ref = C_ref.reshape(o, env.BATCH, m, env.BLOCK_OUT).transpose((0, 2, 1, 3))\nnp.testing.assert_equal(C_ref, C_nd.numpy())\nprint(\"Successful vector add test!\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Summary\n-------\nThis tutorial provides a walk-through of TVM for programming the\ndeep learning accelerator VTA with a simple vector addition example.\nThe general workflow includes:\n\n- Programming the FPGA with the VTA bitstream over RPC.\n- Describing the vector add computation via a series of computations.\n- Describing how we want to perform the computation using schedule primitives.\n- Compiling the function to the VTA target.\n- Running the compiled module and verifying it against a numpy implementation.\n\nYou are more than welcome to check other examples out and tutorials\nto learn more about the supported operations, schedule primitives\nand other features supported by TVM to program VTA.\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}